Machine learning to analyse omic-data for COVID-19 diagnosis and prognosis

Background With the global spread of COVID-19, the world has seen many patients, including many severe cases. The rapid development of machine learning (ML) has made significant disease diagnosis and prediction achievements. Current studies have confirmed that omics data at the host level can reflect the development process and prognosis of the disease. Since early diagnosis and effective treatment of severe COVID-19 patients remains challenging, this research aims to use omics data in different ML models for COVID-19 diagnosis and prognosis. We used several ML models on omics data of a large number of individuals to first predict whether patients are COVID-19 positive or negative, followed by the severity of the disease. Results On the COVID-19 diagnosis task, we got the best AUC of 0.99 with our multilayer perceptron model and the highest F1-score of 0.95 with our logistic regression (LR) model. For the severity prediction task, we achieved the highest accuracy of 0.76 with an LR model. Beyond classification and predictive modeling, our study founds ML models performed better on integrated multi-omics data, rather than single omics. By comparing top features from different omics dataset, we also found the robustness of our model, with a wider range of applicability in diverse dataset related to COVID-19. Additionally, we have found that omics-based models performed better than image or physiological feature-based models, proving the importance of the omics-based dataset for future model development. Conclusions This study diagnoses COVID-19 positive cases and predicts accurate severity levels. It lowers the dependence on clinical data and professional judgment, by leveraging the utilization of state-of-the-art models. our model showed wider applicability across different omics dataset, which is highly transferable in other respiratory or similar diseases. Hospital and public health care mechanisms can optimize the distribution of medical resources and improve the robustness of the medical system.

The dataset used a specific COVID-GRAM risk score for risk assessment to test whether the severity is related to DNA-methylation patterns in blood or not. The COVID-GRAM risk is synthesised from a series of clinical data and can reflect the patient's overall situation and disease progression. This score provides an estimate of the risk of critical illness for COVID-19 inpatients and helps to identify COVID-19 patients who may subsequently develop a critical illness. When this index is larger than a threshold, the patient's disease risk is higher, which is defined as a high-risk case. Gram risk score only evaluates patients with COVID-19 and does not apply to patients without COVID-19.

RNA-seq
We collected RNA-seq dataset from a study of whole blood analysis, which is a body fluid composed of blood and plasma in severe COVID-19 patients and healthy blood donors (NCBI: GSE171110) and had access to clinical information records of cases [3]. This dataset was used to analyse whole blood gene expression. Samples were collected from 54 individuals in total, of which 10 were healthy, and 44 were severely COVID-19-affected. 1

Proteomic and metabolomic
We obtained proteomics and metabolomics of blood serum from a group of COVID-19 patients from an open-source study [4,5]. The data was collected from 65 patients who visited Taizhou Hospital between January and March 2020.
The serum samples were collected and were verified as COVID-19 positive or negative cases using PCR tests. For the positive sample, they set four stages of clinical characteristics: mild, typical, severe, and critical. We obtained a complete set of proteomic and metabolomic samples from the total samples. It contained protein matrix (791 features) and metabolite matrix (847 features) data from 31 COVID-19 patients (18 non-severe and 13 severe patients), with detailed patient descriptions.

Multi-omics dataset
For the subgroup division and prediction of COVID-19 severity, we used three datasets, namely 4-omics data containing transcripts, proteomic, metabolomic, and lipids, and two independent omics data for validation (proteomic and metabolomic). We used the 4-omics data from [6], collected from 128 adult SARS-CoV-2 virus-associated patients from Albany Medical Center in Albany, New York, USA. Blood samples of these patients were collected, and then transcripts, proteomics, metabolites, and lipids were measured from plasma. In addition, the authors identified and quantified the abundance of more than In the four data cohorts, females accounted for 37.3% of the COVID-19 group. The mean age of patients in the COVID-19 group was 61.3 years for males and 63.1 years for females. The COVID-19 group has more ethnic diversity, with whites accounting for 46% of the total. These demographics in this dataset are similar to the racial and ethnic health distribution reported by another study of COVID-19 [7]. The demographic distribution and characteristics of the dataset are similar to the distribution of social groups. So the dataset we use is somewhat representative of the proportion of healthy people, COVID-19 patients, and severe COVID-19 patients in society, with their demographic characteristics and health status.

Validation dataset
Independent datasets were used to validate the predictive performance of key features derived from the severity subgroup analysis. It can determine the method's robustness and assess generalisability to a wide range of omic data. Therefore, our selection criteria for the independent dataset were that it had to be the same omic type as the training dataset and be derived from a different sample of cases. Accordingly, we selected the proteomic and metabolomic data from the single-omic COVID-19 diagnosis problem. This dataset was used to verify the performance of the key features of proteomics and metabolomics obtained in training the four-omics model for predicting COVID-19 severity. We used the proteomics and metabolomics data with key features from the training set to train the model and verified them on the two validation data. We performed preliminary data processing. We retain the key features selected on the training set for the validation dataset. For the treatment of nulls in the dataset, we removed features with NA values greater than 20% and performed zero-value padding.